Distributed health check for global server load balancing

ABSTRACT

A global server load-balancing (GSLB) switch serves as a proxy to an authoritative DNS and communicates with numerous site switches that are coupled to host servers serving specific applications. The GSLB switch receives from site switches operational information regarding host servers within the site switches neighborhood. This operational information includes health check information that is remotely obtained in a distributed manner from remote metric agents at the site switches. When a client program requests a resolution of a host name, the GSLB switch, acting as a proxy of an authoritative DNS, returns one or more ordered IP addresses for the host name. The IP addresses are ordered using metrics, including the health check metric that evaluates these IP addresses based on the health check information communicated to the GSLB switch in a distributed manner by the distributed health check site switches. In one instance, the GSLB switch places the address that is deemed “best” at the top of the list.

CROSS-REFERENCE TO RELATED APPLICATION

The present application is a continuation-in-part of U.S. application Ser. No. 09/670,487, entitled “GLOBAL SERVER LOAD BALANCING,” filed Sep. 26, 2000, assigned to the same assignee as the present application, and which is incorporated herein by reference its entirety.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This disclosure relates generally to load balancing among servers. More particularly but not exclusively, the present disclosure relates to achieving load balancing by, in response to resolving a DNS query by a client, providing the address of a server that is expected to serve the client with a high performance in a given application, based at least in part on remotely obtained health check information.

2. Description of the Related Art

Under the TCP/IP protocol, when a client provides a symbolic name (“URL”) to request access to an application program or another type of resource, the host name portion of the URL needs to be resolved into an IP address of a server for that application program or resource. For example, the URL (e.g., http://www.foundrynet.com/index.htm) includes a host name portion www.foundrynet.com that needs to be resolved into an IP address. The host name portion is first provided by the client to a local name resolver, which then queries a local DNS server to obtain a corresponding IP address. If a corresponding IP address is not locally cached at the time of the query, or if the “time-to-live” (TTL) of a corresponding IP address cached locally has expired, the DNS server then acts as a resolver and dispatches a recursive query to another DNS server. This process is repeated until an authoritative DNS server for the domain (e.g., foundrynet.com, in this example) is reached. The authoritative DNS server returns one or more IP addresses, each corresponding to an address at which a server hosting the application (“host server”) under the host name can be reached. These IP addresses are propagated back via the local DNS server to the original resolver. The application at the client then uses one of the IP addresses to establish a TCP connection with the corresponding host server. Each DNS server caches the list of IP addresses received from the authoritative DNS for responding to future queries regarding the same host name, until the TTL of the IP addresses expires.

To provide some load sharing among the host servers, many authoritative DNS servers use a simple round-robin algorithm to rotate the IP addresses in a list of responsive IP addresses, so as to distribute equally the requests for access among the host servers.

The conventional method described above for resolving a host name to its IP addresses has several shortcomings. For instance, the authoritative DNS does not detect a server that is down. Consequently, the authoritative DNS server continues to return a disabled host server's IP address until an external agent updates the authoritative DNS server's resource records. Further, the conventional DNS algorithm allows invalid IP addresses (e.g., that corresponding to a downed server) to persist in a local DNS server until the TTL for the invalid IP address expires.

SUMMARY OF THE INVENTION

One aspect of the present invention provides a system to balance load among host servers. The system includes an authoritative domain name server, and a load balance switch coupled to the authoritative domain name server as a proxy to the authoritative domain name server. A plurality of site switches are communicatively coupled to the load balance switch and remote from the load balance switch. At least one of the site switches can obtain health check information indicative of health status of ports associated with host servers for that site switch and can provide the obtained health check information to the load balance switch, to allow the load balance switch to arrange a list of network addresses from the authoritative domain name server based at least in part on the health check information provided by the site switch.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a global server load-balancing configuration in which one embodiment of the invention may be implemented.

FIG. 2 illustrates in a flow chart an embodiment of a technique to perform distributed health checks for the configuration of FIG. 1.

FIG. 3 is a block diagram showing the functional modules of a GSLB switch and a site switch relevant to distributed health checking for the global server load balancing function in accordance with one embodiment of the invention.

DETAILED DESCRIPTION

Embodiments for global server load-balancing techniques that are based at least in part on distributed health check information are described herein. In the following description, numerous specific details are given to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.

Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

As an overview, an embodiment of the invention provides a global server load-balancing (GSLB) switch that serves as a proxy to an authoritative DNS and that communicates with numerous site switches coupled to host servers serving specific applications. The GSLB switch receives, from the site switches, operational information regarding host servers being load balanced by the site switches. When a client program requests a resolution of a host name, the GSLB switch, acting as a proxy of an authoritative DNS, returns one or more ordered IP addresses for the host name. The IP addresses are ordered using metrics that include the information collected from the site switches. In one instance, the GSLB switch places the address that is deemed “best” at the top of the list.

One of these metrics includes health check information, which is indicative of the host servers' health. In the prior-filed U.S. application Ser. No. 09/670,487, entitled “GLOBAL SERVER LOAD BALANCING,” filed Sep. 26, 2000 and U.S. application Ser. No. 10/206,580, entitled “GLOBAL SERVER LOAD BALANCING,” filed Jul. 25, 2002, embodiments were disclosed where the GSLB switch carried out health checks in a “centralized manner.” That is, to determine the health of the servers and/or the health of the host application(s) on the servers, the GSLB switch sends Layer 4 transmission control protocol (TCP) or User Datagram Protocol (UDP) health checks to the servers. Layer 3 and Layer 7 health checks can also be sent. If a server fails one of these health checks, it is disqualified from being the “best” IP address.

In contrast to the centralized health check, an embodiment of the present invention performs distributed health checks, where the health-checking tasks are distributed to the peer metric agents at the site switches, instead of being performed by the GSLB switch. The health checking may thus be performed independently of a request from the GSLB switch, in contrast to the centralized health check implementation where the health check information is conveyed in response to a request from the GSLB switch. The distributed health checking allows for reduction in GSLB processing load, reduction in health-check traffic, and increased scalability due to the distribution. Each metric agent generates a health status report, and provides this report to the GSLB switch (such as via part of a protocol message in one embodiment). On receiving the health status report, the GSLB switch processes the health check information therein, updates its records accordingly, and uses the health information to evaluate or modify the DNS response. The health check information may be indicative of access conditions to host servers (including host servers associated with a particular site switch, or with host servers that are not associated with a particular site switch, if that site switch operates as a type of information collector, for instance), and/or the health check information may be indicative of access conditions to an application hosted on a host server or access condition to some other component for which a particular site switch collects health check information.

An embodiment of the invention also allows integration of distributed health check components in systems that also include non-distributed health check components (e.g., centralized health check components). For example, a system described herein includes a GSLB switch and at least one remote metric agent that both support distributed health checks. Embodiments of the distributed health check can also provide compatibility between a remote metric agent that supports distributed health checks and a GSLB switch that does not, or compatibility between a GSLB switch that supports distributed health checks and a remote agent that does not. In situations where both a GSLB switch and a remote agent do not support distributed health checks, a centralized health check (such as disclosed in the co-pending applications identified above) can be implemented. This compatibility allows interoperability, installation, and transition of the distributed health check components into current systems that are based on centralized health checks.

FIG. 1 illustrates an example global server load-balancing configuration in which one embodiment of the invention may be implemented. As shown in FIG. 1, global server load balancing (GSLB) switch 12 is connected to Internet 14 and acts as a proxy to an authoritative Domain Name System (DNS) server 16 for the domain “foundrynet.com” (for example). That is, while the actual DNS service is provided by DNS server 16, the IP address known to the rest of the Internet for the authoritative DNS server of the domain “foundrynet.com” is a virtual IP (VIP) address configured on GSLB switch 12. Of course, DNS server 16 can also act simultaneously as an authoritative DNS for other domains. GSLB switch 12 communicates, via Internet 14, with site switches 18A and 18B at site 20, site switches 22A and 22B at site 24, and any other similarly configured site switches. Site switch 18A, 18B, 22A and 22B are shown, for example, connected to routers 19 and 21 respectively and to servers 26A, . . . , 26I, . . . 26N. Some or all of servers 26A, . . . , 26I, . . . , 26N may host application server programs (e.g., http and ftp) relevant to the present invention. These host servers are reached through site switches 18A, 18B, 22A and 22B using one or more virtual IP addresses configured at the site switches, which act as proxies to the host servers. A suitable switch for implementing either GSLB switch 12 or any of site switches 18A, 18B, 22A and 22B is the “ServerIron” product available from Foundry Networks, Inc. of San Jose, Calif.

FIG. 1 also shows client program 28 connected to Internet 14, and communicates with local DNS server 30. When a browser on client program 28 requests a web page, for example, using a Universal Resource Locator (URL), such as http://www.foundrynet.com/index.htm, a query is sent to local DNS server 30 to resolve the symbolic host name www.foundrynet.com to an IP address of a host server. The client program 28 receives from DNS server 30 a list of IP addresses corresponding to the resolved host name. This list of IP addresses is either retrieved from local DNS server's 30 cache, if the TTL of the responsive IP addresses in the cache has not expired, or obtained from GSLB switch 12, as a result of a recursive query. Unlike the prior art, however, this list of IP addresses is re-ordered in one embodiment by GSLB switch 12 based on performance metrics described in further detail below, one of which is associated with distributed health check information.

In the remainder of this detailed description, for the purpose of illustrating embodiments of the present invention only, the list of IP addresses returned are assumed to be the virtual IP addresses configured on the proxy servers at switches 18A, 18B, 22A and 22B (sites 20 and 24). In one embodiment, GSLB switch 12 determines which site switch would provide the best expected performance (e.g., response time) for client program 28 and returns the IP address list with a virtual IP address configured at that site switch placed at the top. (Within the scope of the present invention, other forms of ranking or weighting the IP addresses in the list can also be possible.) Client program 28 can receive the ordered list of IP addresses, and typically selects the first IP address on the list to access the corresponding host server.

FIG. 3 is a block diagram showing the functional modules of GSLB switch 12 and site switch 18A relevant to the global server load balancing function. For purposes of illustration, the site switch 18A is chosen—it is appreciated that the discussion herein can be appropriately applied to any of the other site switches depicted in FIG. 1. As shown in FIG. 3, GSLB switch 12 includes a GSLB switch controller 401, DNS proxy module 403, metric agent 404, routing metric collector 405, and metric collector 406. GSLB switch controller 401 provides general control functions for the operation of GSLB switch 12. The metric collector 406 communicates with metric agents in site switches (e.g., FIG. 3 shows metric collector 406 communicating with a remote metric agent 407 of a site server load balancing ServerIron or “SLB SI”) to collect switch-specific metrics from each of these switches, which in one embodiment includes health check information.

At the site switch 18A, the remote metric agent 407 is communicatively coupled to a health check module 402. The health check module 402, in a distributed health check embodiment, is responsible for querying host servers and relevant applications hosted on the host servers being load balanced by the site switch 18A to determine the “health” of each host server and each relevant application. In one embodiment, the health information includes a list of VIPs configured at the remote site 18A (e.g., at that SLB SI) and whether the ports associated with these VIPs are up or down. Once this health information is obtained by the health check module 402 (which may be implemented as a software module), the health information is communicated to the remote metric agent 407, which then sends the health information to the metric collector 406 via a protocol message and in a manner that will be described later below.

In a centralized health check embodiment, such as described in the co-pending applications identified above, the health check module 402 is located at the GSLB switch 12, rather than at the site switch 18A. In this implementation, the health check module 402 communicates directly with the GSLB switch controller 401, rather than via protocol messages. Similarly, the local metric agent 404 can communicate health check information to the GSLB switch controller 401 directly, without using the protocol communication.

Routing metric collector 405 collects routing information from routers (e.g., topological distances between nodes on the Internet). FIG. 3 shows, for example, router 408 providing routing metric collector 405 with routing metrics (e.g., topological distance between the load balancing switch and the router), using the Border Gateway Protocol (BGP). DNS proxy module 403 (a) receives incoming DNS requests, (b) provides the host names to be resolved to DNS server 16, (c) receives from DNS server 16 a list of responsive IP addresses, (d) orders the IP addresses on the list received from DNS server 16 according to an embodiment of the present invention, using the metrics collected by routing-metric collector 405 and metric collector 406, and values of any other relevant parameter, and (e) provides the ordered list of IP addresses to the requesting DNS server. It is appreciated that the GSLB switch controller 401 may alternatively or in addition perform the IP address-ordering based on the metrics. Since GSLB switch 12 can also act as a site switch, GSLB switch 12 is provided a local metric agent 404 for collecting metrics. Similar to that in the centralized health check embodiment, the local metric agent 404 communicates health check information to the GSLB switch controller 401 directly, without using the protocol communications of the distributed health check embodiment.

In one embodiment, the metrics used in a GSLB switch 12 include (a) the health of each host server and selected applications, (b) each site switch's session capacity threshold, (c) the round trip time (RTT) between a site switch and a client in a previous access, (d) the geographical location of a host server, (e) the connection-load measure of new connections-per-second at a site switch, (f) the current available session capacity in each site switch, (g) the “flashback” speed between each site switch and the GSLB switch (i.e., how quickly each site switch responds to a health check from the GSLB switch), for implementations that perform centralized health checks rather than distributed health checks, and (h) a policy called the “Least Response Selection” (LRS) which prefers the site switch that has been selected less often than others.

Many of these performance metrics can be provided default values. The order in which these performance metrics can be used to evaluate the IP addresses in the DNS reply can be modified as required. Each metric can be selectively disabled or enabled, such as in systems that include components that support or do not support distributed health checks. Further details of these metrics and how they are used in an example algorithm to re-order an address list to identify the “best” IP address are disclosed in the co-pending applications identified above. For purposes of the present application, such specific details regarding the metrics and their use in the algorithm are omitted herein, so as to instead focus on the techniques to acquire and communicate distributed health check information.

FIG. 2 illustrates in a flow chart 200 an embodiment of a technique to perform distributed health checks for the configuration of FIG. 1. At least some of the elements of the flow chart 200 can be embodied in software or other machine-readable instruction stored on one or more machine-readable storage media. For example, such software to perform operations depicted in the flow chart 200 may be present at the remote site (e.g., the site switch 18A) in one embodiment. Moreover, it is appreciated that the various depicted operations need not necessarily occur in the exact order or sequence as shown.

At a block 210 periodic or asynchronous updates related to health check information may be performed. The updates at the block 210 will be described later below, and updates may be performed and/or communicated at any suitable location in the flow chart 200. At a block 202, health check information is collected at a remote site switch (e.g., the site switch 18A) that supports or is otherwise configured for distributed health checking. In one embodiment, this involves having the remote metric agent 407 cooperate with the health check module 402 to check the status (e.g., up or down) of the virtual ports of the VIPs at the site switch 18A. This could entail determining if at least one of the real ports associated with the virtual port of a VIP is healthy. For example, the health check module 402 can “ping” the real ports associated with a virtual port of a VIP to determine if they respond. If it finds at least one such responsive real port, it concludes that the virtual port of the VIP is healthy.

It is noted that in one embodiment of the centralized health check system, the health check module 402 is located at the GSLB switch 12, and sends health check queries to the remote metric agent 407. The remote metric agent 407 treats this health check query similarly as a normal request, and load balances the request among the real servers behind the site switch 18A. The health check information is returned to the GSLB switch 12 by the remote metric agent 407, and the health check information indicates the health status of the VIP port(s) of the site switch 18A. In contrast with the distributed health check system, the remote metric agent 407 and the health check module 402 cooperate at the block 202 to obtain the health status of the real ports mapped under the VIP ports.

It is also noted that in the centralized health check system, each health check query from the GSLB switch 12 to the site switch 18A is an individual TCP connection, in one embodiment. Thus, a separate TCP connection needs to be established to check the health status of each and every port. Furthermore, the TCP connection needs to be established and torn down each time the health check information needs to be updated at the GSLB switch 12. In one embodiment of the centralized health check, the frequency of updating the health check information may be once every 5 seconds. These multiple TCP connections use up bandwidth and requires more processing. Therefore, as will be explained later in the flow chart 200, an embodiment of the distributed health check can provide the complete health status for ports (real or VIP) and hosted applications via inclusion into a protocol message carried by a single TCP connection that is established initially when the metric collector 406 initiates communication with the remote metric agent 407. This connection is maintained in an exchange of keep-alive messages between the metric collector 406 and the remote metric agent 407. This provides a savings in speed, time, and bandwidth utilization.

At a block 204, the remote metric agent 407 generates an address list (identifying the addresses configured on the site switch 18A) and the health status of the ports corresponding to these addresses. In an embodiment, the address list and port status can correspond to the VIP addresses and VIP ports. Whether a port is up or down can be respectively indicated by a binary 1 or 0, or vice versa. It is appreciated that other types of health information, in addition to the address list and port status, can be generated at the block 204, including health status of hosted applications (e.g., whether an application hosted on a real server is available or unavailable).

At a block 206, the health information is communicated by the remote metric agent 407 to the metric collector 406 of the GSLB switch 12. In one embodiment, the health check information (e.g., address list and port status) is communicated to the GSLB switch 12 as a message forming part of a protocol communication. For instance, FIG. 3 labels this communication as “Foundry GSLB Protocol,” which will be described herein next in the context of communicating health check information. It is appreciated that the Foundry GSLB Protocol is merely intended herein to illustrate an example technique to convey the distributed health check information, and that other embodiments may use different types of communication techniques to convey the distributed health check information.

The Foundry GSLB Protocol is used for communication between the metric collector 406 residing on the GSLB switch 12 and the remote metric agent 407 at the site switch 18A. A communication using this protocol can be established with a single TCP connection that remains persistent/active, without the need to re-establish a new TCP connection each time a message is to be conveyed, in one embodiment. The protocol communication includes a plurality of message types, which are listed below as non-exhaustive examples:

1. OPEN

2. ADDRESS LIST

3. REQUEST

4. RESPONSE

5. REPORT

6. SET PARAMETERS

7. NOTIFICATION

8. KEEP ALIVE

9. CLOSE

10. RTT TRAFFIC

11. OPAQUE

12. ADDRESS LIST DISTRIBUTED (DIST)

13. SET PARAMETERS DIST

14. OPEN DIST

The last three message types (12, 13, and 14) are usable with distributed health checking, while the other message types may be used either with centralized health checking or distributed health checking.

The TCP connection is established by the metric collector 406 under instruction of the switch controller 401. The metric collector 406 attempts to open a persistent communication with all specified remote metric agents 407. Where remote metric agents 407 support distributed health checks, the metric collector 406 uses the “OPEN DIST” message type to initiate and establish a TCP connection that would be used for communication of health check and other relevant information between these two entities.

When conveying the health check information, the message under the protocol (sent from the remote metric agent 407 to the metric collector 406) is under the message type “ADDRESS LIST DIST.” The ADDRESS LIST DIST message includes a list of the addresses and the health status of the corresponding ports. If ports or addresses are removed or added at the site switch 18A, such updated data is also sent along with the ADDRESS LIST DIST message.

The “SET PARAMETERS” and “SET PARAMETERS DIST” message types are sent by the metric collector 406 to the remote metric agent 407. These message types are used to change protocol parameters at the remote metric agent 407. In the distributed health check model, if the metric collector 406 supports distributed health checks but the remote metric agent 407 does not (e.g., is configured for centralized health check), then the metric collector 406 sends the message with SET PARAMETERS message type to the remote metric agent 407 to ensure that the subsequent message format(s) conforms to that used for centralized health checking. The SET PARAMETERS DIST message type is used when both the metric collector 406 and the remote metric agent 407 support distributed health checking.

At a block 208, the GSLB switch 12 receives the health check information and processes it. More specifically, the metric collector 406 receives the health check information that is sent in a protocol message from the remote metric agent 407, and processes this information.

At the block 208, the GSLB switch 12 (in particular the metric collector 406) may also update databases or other stored records/data to reflect the information indicated in the health check information. For example, if new ports or addresses or hosted applications have been added (or removed) at the remote site switch 18A, the stored records at the GSLB switch 12 can be updated to add entries relevant to the newly added (or removed) ports and address and applications, such as their specific numerical address and their health status. Alternatively or in addition, the stored data can be updated to indicate the current health status of any existing address, port, or application.

The metric collector 406 makes this processed health check information and the database(s) mentioned above available to the switch controller 401. The switch controller 401 then uses this health check information as one of the metrics in the GSLB algorithm to determine which address to place at the top of the address list. The flashback metric is disabled for implementations that support distributed health checking, since the flashback metric is used to measure the time it takes for health check information to be returned to the GSLB switch 12. The re-ordered list is subsequently provided to the requesting client program 28

At a block 210, updated health check information is sent from the remote metric agent 407 to the GSLB switch 12. In one embodiment, these updates may be periodic and/or asynchronous updates. Periodic updates are sent at the block 210 periodically from the remote metric agent 407 to the metric collector to communicate to it the latest health information. In addition, asynchronous updates are also sent at the block 210 whenever there is a change in VIP or port configuration at the site switch 18A. In one embodiment, the interval between periodic health check messages is user-configurable, and can range between 2-120 seconds, for example. A default interval can be 5 seconds, for example.

In an embodiment, the remote metric agent(s) 407 is responsible for periodically generating and sending health check information for all the VIPs configured at their respective site switch. The health check reporting interval can be configured globally on the switch controller 401 or locally on an individual remote metric agent 407. Command line interface (CLI) software commands may be used by one embodiment to specify the interval, at the GSLB switch 12 or at the remote site switches. If the reporting interval is configured on the switch controller 401, the interval is communicated to the distributed health check remote metric agents 407 via the SET PARAMETERS DIST message.

The various components of the flow chart 200 repeat or are otherwise performed continuously, as the remote site switch(es) continue to obtain and send health check information to the GSLB switch 12. The GSLB switch 12 responsively continues to examine and process the health check information so as to appropriately re-order the address list for the DNS reply.

The above-described embodiments relate to use of a remote metric agent 407 and the GSLB switch 12 that both support distributed health checks. For situations where neither of these components support distributed health checks, a centralized health check technique (such as described in the co-pending applications) can be used.

Another situation is where the GSLB switch 12 supports distributed health checks, but at least one of the remote agents 407 with which it communicates does not support it. For such situations, the GSLB switch 12 can have installed therein (or otherwise be capable of enabling) its own health check module 402. The non-distributed health check remote metric agents 407 are pre-identified for this GSLB switch 12, so that its health check module 402 can send health checks to these non-distributed health check remote metric agents 407 in a centralized manner. In the protocol communication scheme, a persistent TCP connection to these non-distributed health check remote metric agents 407 initiated by the metric collector 406 uses a message type “OPEN” instead of “OPEN DIST,” for example.

Note that the other remote metric agents 407 that support distributed health check will generate the health check information as described earlier and communicate it to the metric collector 406. The health check module 402 of the GSLB switch 12 does not send any health checks for these distributed health check remote metric agents 407.

In the protocol communication, a connection to these distributed health check remote metric agents 407, initiated by the metric collector 406, uses a message type “OPEN DIST” for these agents.

The flashback metric is disabled, in an embodiment, for this situation where some remote metric agents support distributed health checks while some may not. It is advisable in some instances to enable the flashback metric (via CLI or other technique) only if the user is absolutely certain that none of the remote metric agents 407 support distributed health checks.

Yet another situation is where the GSLB switch 12 does not support distributed health checks, but at least one of the remote metric agents 407 with which it communicates does support it. The remote metric agent 407 can first detect this limitation of the GSLB switch 12, for instance, if its metric collector 406 uses the message type “OPEN” when it first establishes a protocol communication with the remote metric agent 407. Alternatively or in addition, the non-distributed health check GSLB switch 12 can be pre-identified for the remote metric agent 407, or it may detect this limitation if it explicitly receives a query for health check information from the GSLB switch 12. After identification of the non-distributed health check GSLB switch 12, the remote metric agent 407 can send its address list information to the GSLB switch 12 with a message type “ADDRESS LIST” (instead of “ADDRESS LIST DIST”) or other format compatible with a centralized health check implementation. Note that unlike the ADDRESS LIST DIST message sent by the distributed health check remote agent 407 to a distributed health check metric collector 406, the ADDRESS LIST message sent to a non-distributed health check metric collector 406 does not contain any health check information. In one embodiment of centralized health check, the ADDRESS LIST message merely serves the purpose of communicating the addresses configured on site switch 18A to the metric collector 406.

In one embodiment of an optimization algorithm utilized by GSLB switch 12 and executed by the switch controller 401 to process the IP address list received from DNS server 16, the health check metric is used as the first criteria to determine which IP address is “best” and to preliminarily place that IP address at the top of the list of IP addresses. Thereafter, other metrics may be used to perform additional re-ordering of the IP address list, such as a connection-load metric, FIT, flashback (for systems that include non-distributed health check components), and so forth. In one embodiment, the health check information, whether obtained by either the distributed or the centralized techniques, are considered in the same priority in the algorithm—only the process by which this health check information is obtained and communicated is different.

In systems that include both distributed and non-distributed health check components, the flashback metric can be selectively enabled or disabled. When used in connection with all non-distributed health check components, the flashback metric is enabled and placed in the algorithm just prior to the least response metric, in an embodiment, when considering a list of IP addresses corresponding to the servers and applications associated with a remote metric agent 407 that does not support distributed health check.

All of the above U.S. patents, U.S. patent application publications, U.S. patent applications, foreign patents, foreign patent applications and non-patent publications referred to in this specification and/or listed in the Application Data Sheet, are incorporated herein by reference, in their entirety.

The above description of illustrated embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention and can be made without deviating from the spirit and scope of the invention.

These modifications can be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification and the claims. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation. 

What is claimed is:
 1. A system comprising: a first network device to site switch, adapted to collect health check information indicative of access conditions to components for which the first network device performs switching, wherein the collection of health check information occurs independently of a request; and a metric agent, at the first network device, adapted to communicate the collected health check information to a second network device to load balance, the communication occurring a plurality of times over a persistent connection between the first network device and the second network device, wherein the persistent connection comprises a single TCP connection initiated by the second network device; and wherein the health check information comprises information indicative of which ports of one or more host servers coupled to the first network device are up or down and the health status of applications hosted on the one or more host servers.
 2. The system of claim 1, wherein the health check information further comprises information indicative of access conditions to an application hosted on the one or more host servers.
 3. The system of claim 1, further comprising a plurality of other network devices to site switch that are not adapted to support distributed health check information collection, wherein the plurality of other network devices to site switch are identifiable by the second network device to enable the second network device to process their respective health check information.
 4. The system of claim 1, further comprising a third network device to load balance that is not enabled to support distributed health check information collection by its corresponding network devices to site switch, wherein the corresponding network devices to site switch are coupled to respond to health check queries from the third network device and to modify other information into a format that is adapted to be processed by the third network device.
 5. The system of claim 1 wherein a time interval to send the collected health check information from the first network device to the second network device is specified globally for the first network device at the second network device.
 6. The system of claim 1, wherein the metric agent at the first network device is adapted to communicate the health check information to the second network device independent of a query for the health check information from the second network device.
 7. A method of providing load balancing, the method comprising: establishing a persistent connection between a first network device to load balance and at least one second network device to site switch, the at least one second network device remote from the first network device, wherein the persistent connection comprises a single TCP connection initiated by the first network device; receiving a plurality of times, at the first network device through the persistent connection, health check information collected at the at least one second network device and indicative of access conditions to respective host servers for which the at least one second network device performs switching, wherein the collection of health check information occurs independently of a request; arranging, at the first network device, network addresses in accordance with a set of performance metrics that include the health check information collected by and received from the at least one second network device; and disabling a flashback metric, from among the set of performance metrics, indicative of a time to respond to a health check request sent by the first network device.
 8. The method of claim 7, wherein the health check information includes addresses associated with each of the at least one second network device and status indicative of which ports corresponding to the addresses are up or are down, the addresses and status being present in a message sent from each of the at least one second network device, wherein receiving the health check information collected by the at least one second network device through the persistent connection comprises: maintaining the persistent connection between the first network device and each of the at least one second network device using a keep-alive message type; receiving, from each of the at least one second network device, the message via each persistent connection and independently of a query for the message from the first network device; and receiving, from each of the at least one second network device and independently of a query for the message from the first network device, an update to the health check information in an additional message after a specified time interval, and including information indicative of a change in addresses or ports at the each of the at least one second network device.
 9. The method of claim 7, further comprising identifying other network devices to site switch that collect health check information in response to requests from the first network device.
 10. The method of claim 7, further comprising: specifying, at the first network device and globally for all of the at least one second network device, a time interval to provide the health check information to the first network device; or specifying, individually for each of the at least one second network device, the time interval to be used by that specific second network device to provide the health check information to the first network device.
 11. The method of claim 7, wherein the received health check information, collected at the at least one second network device, includes information indicative of access to applications hosted at the host servers.
 12. The method of claim 7, wherein the network addresses include virtual IP addresses, at least one of said virtual IP addresses being configured at the at least one second network device and corresponding to at least one of the host servers of the at least one second network device, wherein the first network device is adapted to perform said arranging to balance traffic between a plurality of the at least one second network device.
 13. An article of manufacture, comprising: a non-transitory storage medium having instructions stored thereon that are executable by a first network device to load balance, to: process health check information remotely collected by at least one of a plurality of second network devices to site switch, the health check information indicative of access conditions to respective host servers for which at least one of the second network devices performs switching, wherein the collection of health check information occurs independently of a request; and arrange network addresses in accordance with a set of performance metrics that include the health check information, wherein one of the performance metrics includes a flashback metric representing a speed to respond to a request from the first network device for health check information, the flashback metric being disabled based on the health check information remotely collected a plurality of times over a persistent connection by at least one of the second network devices, wherein the persistent connection comprises a single TCP connection initiated by the first network device.
 14. The article of manufacture of claim 13, wherein the network addresses include virtual IP addresses, at least one of said virtual IP addresses being configured at each respective second network device and each of the virtual IP addresses corresponding to at least one of the host servers of the respective second network device, wherein the first network device is adapted to arrange network addresses.
 15. The article of manufacture of claim 13, wherein the health check information is received by the first network device through a persistent connection to each of the second network devices, wherein the persistent connection comprises a single TCP connection initiated by the first network device.
 16. The article of manufacture of claim 15, wherein the persistent connection uses a keep-alive message type to maintain persistency to enable the health check information to be conveyed to the first network device independently of a query for the health check information by the first network device, and without having to establish a new connection to separately convey health check information from the at least one of the second network devices.
 17. An article of manufacture, comprising: a non-transitory storage medium having instructions stored thereon that are executable by a first network device to load balance, to: establish a persistent connection between the first network device and at least one of a plurality of second network devices to site switch remote from the first network device, wherein the persistent connection comprises a single TCP connection initiated by the first network device, and; process health check information remotely collected by the at least one of the plurality of second network devices and indicative of access conditions to respective host servers for which the at least one of the plurality of second network devices perform switching, the health check information being received a plurality of times by the first network device through the persistent connection between the at least one of the plurality of second network devices, and the collection of health check information occurring independently of a request; and arrange network addresses in accordance with a set of performance metrics that include the health check information, wherein the health check information includes addresses associated with each of the plurality of second network devices and includes indications of which ports associated with each of the addresses are up or are down.
 18. The article of manufacture of claim 17, wherein a time interval for the at least one of the plurality of second network devices to provide the health check information is individually specified for each of the at least one of the plurality of second network devices.
 19. A system comprising: a first network device to site switch, adapted to receive health check information, the health check information being indicative of access conditions to components for which the first network device collects health check information and for which the first network device performs switching, wherein the collection of health check information occurs independently of a request; and a metric agent, at the first network device, adapted to communicate the health check information to a second network device to load balance, the communication occurring a plurality of times using a persistent connection between the first network device and the second network device, wherein the persistent connection comprises a single TCP connection initiated by the second network device; and wherein the first network device is adapted to be communicatively coupled to the second network device and is adapted to convey the health check information a plurality of times on the persistent connection to the second network device as part of a keep-alive message, wherein the health check information comprises information indicative of which ports of one or more host servers coupled to the first network device are up or down and the health status of applications hosted on the one or more host servers.
 20. A system to balance load, the system comprising: a first network device to site switch, to collect health check information indicative of access conditions to components for which the first network device performs switching and the health status of applications hosted on one or more servers coupled to the first network device, wherein the collection of health check information occurs independently of a request; and a metric agent, at the first network device, to communicate the collected health check information to a second network device to load balance, the communication occurring a plurality of times using a persistent connection between the first network device and the second network device, wherein the persistent connection comprises a single TCP connection initiated by the second network device; and wherein a flashback metric representing a time to respond to a query for the health check information is disabled based on the collection of the health check information distributed from the second network device to the first network device. 